20 research outputs found

    An efficiency curve for evaluating imbalanced classifiers considering intrinsic data characteristics: Experimental analysis

    Get PDF
    Balancing the accuracy rates of the majority and minority classes is challenging in imbalanced classification. Furthermore, data characteristics have a significant impact on the performance of imbalanced classifiers, which are generally neglected by existing evaluation methods. The objective of this study is to introduce a new criterion to comprehensively evaluate imbalanced classifiers. Specifically, we introduce an efficiency curve that is established using data envelopment analysis without explicit inputs (DEA-WEI), to determine the trade-off between the benefits of improved minority class accuracy and the cost of reduced majority class accuracy. In sequence, we analyze the impact of the imbalanced ratio and typical imbalanced data characteristics on the efficiency of the classifiers. Empirical analyses using 68 imbalanced data reveal that traditional classifiers such as C4.5 and the k-nearest neighbor are more effective on disjunct data, whereas ensemble and undersampling techniques are more effective for overlapping and noisy data. The efficiency of cost-sensitive classifiers decreases dramatically when the imbalanced ratio increases. Finally, we investigate the reasons for the different efficiencies of classifiers on imbalanced data and recommend steps to select appropriate classifiers for imbalanced data based on data characteristics.National Natural Science Foundation of China (NSFC) 71874023 71725001 71771037 7197104

    Hierarchical fuzzy rule based classification systems with genetic rule selection for imbalanced data-sets

    Get PDF
    In many real application areas, the data used are highly skewed and the number of instances for some classes are much higher than that of the other classes. Solving a classification task using such an imbalanced data-set is difficult due to the bias of the training towards the majority classes. The aim of this paper is to improve the performance of fuzzy rule based classification systems on imbalanced domains, increasing the granularity of the fuzzy partitions on the boundary areas between the classes, in order to obtain a better separability. We propose the use of a hierarchical fuzzy rule based classification system, which is based on the refinement of a simple linguistic fuzzy model by means of the extension of the structure of the knowledge base in a hierarchical way and the use of a genetic rule selection process in order to get a compact and accurate model. The good performance of this approach is shown through an extensive experimental study carried out over a large collection of imbalanced data-sets.Spanish Ministry of Education and Science (MEC) under Projects TIN-2005-08386-C05-01 and TIN-2005-08386- C05-0

    An Analysis of the Rule Weights and Fuzzy Reasoning Methods for Linguistic Rule Based Classification Systems Applied to Problems with Highly Imbalanced Data Sets

    Get PDF
    In this contribution we carry out an analysis of the rule weights and Fuzzy Reasoning Methods for Fuzzy Rule Based Classification Systems in the framework of imbalanced data-sets with a high imbalance degree. We analyze the behaviour of the Fuzzy Rule Based Classification Systems searching for the best configuration of rule weight and Fuzzy Reasoning Method also studying the cooperation of some pre-processing methods of instances. To do so we use a simple rule base obtained with the Chi (and co-authors’) method that extends the wellknown Wang and Mendel method to classification problems. The results obtained show the necessity to apply an instance preprocessing step and the clear differences in the use of the rule weight and Fuzzy Reasoning Method. Finally, it is empirically proved that there is a superior performance of Fuzzy Rule Based Classification Systems compared to the 1-NN and C4.5 classifiers in the framework of highly imbalanced data-sets.Spanish Projects TIN-2005-08386-C05-01 & TIC-2005-08386- C05-0

    Why Linguistic Fuzzy Rule Based Classification Systems perform well in Big Data Applications?

    Get PDF
    The significance of addressing Big Data applications is beyond all doubt. The current ability of extracting interesting knowledge from large volumes of information provides great advantages to both corporations and academia. Therefore, researchers and practitioners must deal with the problem of scalability so that Machine Learning and Data Mining algorithms can address Big Data properly. With this end, the MapReduce programming framework is by far the most widely used mechanism to implement fault-tolerant distributed applications. This novel framework implies the design of a divide-and-conquer mechanism in which local models are learned separately in one stage (Map tasks) whereas a second stage (Reduce) is devoted to aggregate all sub-models into a single solution. In this paper, we focus on the analysis of the behavior of Linguistic Fuzzy Rule Based Classification Systems when embedded into a MapReduce working procedure. By retrieving different information regarding the rules learned throughout the MapReduce process, we will be able to identify some of the capabilities of this particular paradigm that allowed them to provide a good performance when addressing Big Data problems. In summary, we will show that linguistic fuzzy classifiers are a robust approach in case of scalability requirements.This work have been partially supported by the Spanish Ministry of Science and Technology under projects TIN2014-57251-P and TIN2015-68454-R

    Multiplex Analysis of CircRNAs from Plasma Extracellular Vesicle-Enriched Samples for the Detection of Early-Stage Non-Small Cell Lung Cancer

    Get PDF
    Background: The analysis of liquid biopsies brings new opportunities in the precision oncology field. Under this context, extracellular vesicle circular RNAs (EV-circRNAs) have gained interest as biomarkers for lung cancer (LC) detection. However, standardized and robust protocols need to be developed to boost their potential in the clinical setting. Although nCounter has been used for the analysis of other liquid biopsy substrates and biomarkers, it has never been employed for EV-circRNA analysis of LC patients. Methods: EVs were isolated from early-stage LC patients (n = 36) and controls (n = 30). Different volumes of plasma, together with different number of preamplification cycles, were tested to reach the best nCounter outcome. Differential expression analysis of circRNAs was performed, along with the testing of different machine learning (ML) methods for the development of a prognostic signature for LC. Results: A combination of 500 L of plasma input with 10 cycles of pre-amplification was selected for the rest of the study. Eight circRNAs were found upregulated in LC. Further ML analysis selected a 10-circRNA signature able to discriminate LC from controls with AUC ROC of 0.86. Conclusions: This study validates the use of the nCounter platform for multiplexed EV-circRNA expression studies in LC patient samples, allowing the development of prognostic signatures.European Union's Horizon 2020 research and innovation program under the Marie Sklodowska-Curie grant 76549

    IIVFDT: Ignorance Functions based Interval-Valued Fuzzy Decision Tree with Genetic Tuning

    Get PDF
    The choice of membership functions plays an essential role in the success of fuzzy systems. This is a complex problem due to the possible lack of knowledge when assigning punctual values as membership degrees. To face this handicap, we propose a methodology called Ignorance functions based Interval-Valued Fuzzy Decision Tree with genetic tuning, IIVFDT for short, which allows to improve the performance of fuzzy decision trees by taking into account the ignorance degree. This ignorance degree is the result of a weak ignorance function applied to the punctual value set as membership degree. Our IIVFDT proposal is composed of four steps: (1) the base fuzzy decision tree is generated using the fuzzy ID3 algorithm; (2) the linguistic labels are modeled with Interval-Valued Fuzzy Sets. To do so, a new parametrized construction method of Interval-Valued Fuzzy Sets is defined, whose length represents such ignorance degree; (3) the fuzzy reasoning method is extended to work with this representation of the linguistic terms; (4) an evolutionary tuning step is applied for computing the optimal ignorance degree for each Interval-Valued Fuzzy Set. The experimental study shows that the IIVFDT method allows the results provided by the initial fuzzy ID3 with and without Interval-Valued Fuzzy Sets to be outperformed. The suitability of the proposed methodology is shown with respect to both several state-of-the-art fuzzy decision trees and C4.5. Furthermore, we analyze the quality of our approach versus two methods that learn the fuzzy decision tree using genetic algorithms. Finally, we show that a superior performance can be achieved by means of the positive synergy obtained when applying the well known genetic tuning of the lateral position after the application of the IIVFDT method.Spanish Government TIN2011-28488 TIN2010-1505

    SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary

    Get PDF
    The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is considered \de facto" standard in the framework of learning from imbalanced data. This is due to its simplicity in the design of the procedure, as well as its robustness when applied to di erent type of problems. Since its publication in 2002, SMOTE has proven successful in a variety of applications from several di erent domains. SMOTE has also inspired several approaches to counter the issue of class imbalance, and has also signi cantly contributed to new supervised learning paradigms, including multilabel classi cation, incremental learning, semi-supervised learning, multi-instance learning, among others. It is standard benchmark for learning from imbalanced data. It is also featured in a number of di erent software packages | from open source to commercial. In this paper, marking the fteen year anniversary of SMOTE, we re ect on the SMOTE journey, discuss the current state of a airs with SMOTE, its applications, and also identify the next set of challenges to extend SMOTE for Big Data problems.This work have been partially supported by the Spanish Ministry of Science and Technology under projects TIN2014-57251-P, TIN2015-68454-R and TIN2017-89517-P; the Project 887 BigDaP-TOOLS - Ayudas Fundaci on BBVA a Equipos de Investigaci on Cient ca 2016; and the National Science Foundation (NSF) Grant IIS-1447795

    SOUL: Scala Oversampling and Undersampling Library for imbalance classification

    Get PDF
    This work has been supported by the research project TIN2017-89517-P, by the UGR research contract OTRI 3940 and by a research scholarship, given to the authors Nestor Rodriguez and David Lopez by the University of Granada, Spain.The improvements in technology and computation have promoted a global adoption of Data Science. It is devoted to extracting significant knowledge from high amounts of information by means of the application of Artificial Intelligence and Machine Learning tools. Among the different tasks within Data Science, classification is probably the most widespread overall. Focusing on the classification scenario, we often face some datasets in which the number of instances for one of the classes is much lower than that of the remaining ones. This issue is known as the imbalanced classification problem, and it is mainly related to the need for boosting the recognition of the minority class examples. In spite of a large number of solutions that were proposed in the specialized literature to address imbalanced classification, there is a lack of open-source software that compiles the most relevant ones in an easy-to-use and scalable way. In this paper, we present a novel software approach named as SOUL, which stands for Scala Oversampling and Undersampling Library for imbalanced classification. The main capabilities of this new library include a large number of different data preprocessing techniques, efficient execution of these approaches, and a graphical environment to contrast the output for the different preprocessing solutions.UGR research contract OTRI 3940University of Granada, SpainTIN2017-89517-

    Combinatorial Blood Platelets-Derived circRNA and mRNA Signature for Early-Stage Lung Cancer Detection

    Get PDF
    The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/ijms24054881/s1.Despite the diversity of liquid biopsy transcriptomic repertoire, numerous studies often exploit only a single RNA type signature for diagnostic biomarker potential. This frequently results in insufficient sensitivity and specificity necessary to reach diagnostic utility. Combinatorial biomarker approaches may offer a more reliable diagnosis. Here, we investigated the synergistic contributions of circRNA and mRNA signatures derived from blood platelets as biomarkers for lung cancer detection. We developed a comprehensive bioinformatics pipeline permitting an analysis of platelet- circRNA and mRNA derived from non-cancer individuals and lung cancer patients. An optimal selected signature is then used to generate the predictive classification model using machine learning algorithm. Using an individual signature of 21 circRNA and 28 mRNA, the predictive models reached an area under the curve (AUC) of 0.88 and 0.81, respectively. Importantly, combinatorial analysis including both types of RNAs resulted in an 8-target signature (6 mRNA and 2 circRNA), enhancing the differentiation of lung cancer from controls (AUC of 0.92). Additionally, we identified five biomarkers potentially specific for early-stage detection of lung cancer. Our proof-of-concept study presents the first multi-analyte-based approach for the analysis of platelets-derived biomarkers, providing a potential combinatorial diagnostic signature for lung cancer detection.European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie 765492

    Digital multiplexed analysis of circular RNAs in FFPE and fresh non-small cell lung cancer specimens

    Get PDF
    We would like to thank Stephanie Davis for her language editing assistance. The investigators also wish to thank the patients for kindly agreeing to donate samples to this study. We thank all the physicians who collaborated by providing clinical information. Graphical Abstract, Figs 1A, 8A and Fig. S1 were created with Biorender.com. This project has received funding from a European Union's Horizon 2020 research and innovation program under the Marie SklodowskaCurie grant agreement ELBA No 765492.Although many studies highlight the implication of circular RNAs (circRNAs) in carcinogenesis and tumor progression, their potential as cancer biomarkers has not yet been fully explored in the clinic due to the limitations of current quantification methods. Here, we report the use of the nCounter platform as a valid technology for the analysis of circRNA expression patterns in non-small cell lung cancer (NSCLC) specimens. Under this context, our custom-made circRNA panel was able to detect circRNA expression both in NSCLC cells and formalin-fixed paraffinembedded (FFPE) tissues. CircFUT8 was overexpressed in NSCLC, contrasting with circEPB41L2, circBNC2, and circSOX13 downregulation even at the early stages of the disease. Machine learning (ML) approaches from different paradigms allowed discrimination of NSCLC from nontumor controls (NTCs) with an 8-circRNA signature. An additional 4-circRNA signature was able to classify early-stage NSCLC samples from NTC, reaching a maximum area under the ROC curve (AUC) of 0.981. Our results not only present two circRNA signatures with diagnosis potential but also introduce nCounter processing following ML as a feasible protocol for the study and development of circRNA signatures for NSCLC.European Commission 76549
    corecore